WQD7003 - Group 5

22050480 | Lee Tse Lyn
S2190151 | Wee Hin Sheik
17147394 | Fathia Farhana bt Agusalim
S2196123 | Samuel Tan Joo Woon
S2179520 | Khai Yao Wong

Reference Link:

  1. Github Repositories
  2. Web Page

Background: The CRoss Industry Standard Process for Data Mining (CRISP-DM) is being applied in order to study the admission and discharge patients in emergency department at Hero DMC Heart Institute. Two (2) types of data analytics will be executed for this project which inclusive of Descriptive and Predictive Analytics.


1. Business Understanding

There are several pain points that are experienced in the Hospital Emergency Department as follows:

Assumptions and limitations: Generally, prioritisation of patient admission is based on the triage scale in which this triage is utilised in the healthcare community to categorise patients based on the severity of their injuries i.e. category 1 (immediate), category 2 (urgent) and category 3 (non-urgent). However, for this project, level of triage scale to attend patient during emergency is excluded from the scope as patients records from the dataset comprises categorical data that is categorised into "outpatient" and "emergency" only

Note: Throughput is defined as the amount of material or items passng through a system or process

Business Goal: To aid the emergency deparment strategy on admission and discharge by providing admission prediction

Data Mining Goal:

Expected OutcomeExpected outcome.png

Project Timeline: The project has divided the milestone into 5 sprint deliverables and expected to deploy the data products on Week 13 as Release 1.0 Project Timeline.png


2. Data Understanding

The dataset is retrieved from open source i.e., Kaggle hospital-admissions-data​ which consists of admission and discharge patients records from Hero DMC Heart Institute, India.

There are a total of 15,758 rows and 56 columns available in the datasets which comprises of categorical and numerical data types. Column Name .png

Automated exploraty data analysis

This step is being done to get quick high level overview on the data


3. Data Cleaning

Based on the Data Understanding result above, data cleaning and preprocessing to be performed in order to resolve issues pertaining irrelevant information, errors, missing values, and etc as well as performing data transformation to ensure that the data is suitable for further analysis.

Steps for data cleaning executed are as below: \

  1. Use Emergency only
  2. Check missing values
  3. Fill 0 as default for disease columns
  4. Remove unnecessary columns
    • SNO
    • MRD No.
    • TYPE OF ADMISSION-EMERGENCY/OPD
    • Date
  5. Rename attribute 'SMOKING ' TO 'SMOKING'
  6. GLUCOSE, UREA, PLATELETS, TLC, HB, EF, CREATININE, BNP, data need conversion to standard unit or simply remove the outlier
  7. Replace / to 0 in CHEST INFECTION column
  8. Label encoding for machine learning (GENDER, RURAL, OUTCOME*)
  9. Apply normalization on numerical attributes, GLUCOSE, UREA, PLATELETS, TLC HB, EF, CREATININE, BNP
  10. Data reduction, dimensional reduction (PCA)? 10*. Prepare the data for model training
  11. Split the data into training and testing set

Exploratory Data Analysis (EDA)

An EDA is performed in order to assess the trend of admitted patients occupied during emergency situations and to identify any patterns that may emerge
From the bar chart above, it can be concluded that maximum duration of stay for emergency patients in all age group is more than 20 days with maximum stay of more than 90 days (3 months) for patients with age between 60 to 90 years
It can be observed that average duration of stay of the emergency patients for all age group is approximately 7 days
3. Assess duration of patients stays in hospital according to their symptoms/diagnoses
0 - no symptom/diagnose
1 - have symptom/diagnose
4. Assess correlation of duration of stays of emergency patients based on their symptoms/ diagnoses
From the correlation matrix above, the features that give significant correlation with 'Duration of Stay' with > 0.3 is when emergency patient experiencing:

Principal Component Analysis (PCA) </br>

This section will apply PCA to reduce number of attributes. </br> The first section will determine the Principal Component Index. </br> The second section will apply PCA to the dataset. </br>

4. Modelling

The dataset is split into 80% - 20% as a training and a testing set respectively for evaluating the model performance. Upon splitting the dataset, the sizes of the training and testing set are 8739 and 2185 respectively.

The following models are evaluated, with metrics including R-squared, mean absolute error, mean squared error and root mean squared error:

(1) Linear Regression

(2) Decision Tree Regression

(3) Random Forest Regression

(4) Support Vector Regression

(5) Bayesian Ridge Regression

(6) Gradient Boosting Regression

(7) Elastic Net Regression

(8) Light Gradient Boosting Machine Regression

(9) Extreme Gradient Boosting Regression

(10) K-Nearest Neighbors Regression

Cross validation technique will be applied to the best model to obtain optimal hyperparameters / configurations

From the results above, for the present problem, the best performing model is Random Forest with the highest R-squared value (Coefficient of Determination) and least root mean squared error (RMSE).

However, only about 22.16% of the variation in patients' duration of stay can be explained by the independent variables in the dataset by using Random Forest Regressor.

To improve model performance, we perform GridSearch with cross validation on Random Forest model to obtain the best hyperparameters.

By doing so, the accuracy / loss for every combination of hyperparameters are computed and we can select the one with the best performance.

As presented in the model performance table, the highest R-squared value of all models trained was 0.2216. The low performance of these models might indicate that there is no relationship between the features (x) and patients's duration of stay (y) in the data, or there is a non-linear relationship between x and y.

To identify if there is a non-linear relationship between x and y, polynomial linear regression algorithm is applied to train the data, and obtain the metric scores. This is performed by identifying the optimal degree of independent variables (x) with an elbow plot, and train the model with the obtained best degree.

5. Deployment

The output of the notebook is execute with papermill, it is then deploy to Github Pages for web hosting. Below is the link to deployed page:

https://samueltan3972.github.io/WQD7003-DataAnalytics.html

6. Project Close Out and Restropective Learning

As part of the project completion, below deliverables are produced for project close out:

The team also conduct restrospective session internally to evaluate the overall experience in delivering the project by capturing the key success stories, main challenges and further improvement that can be made in the future:

1. What went well

2. What are the main challenges

3. What can be improved